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VLSI  communication  networks  are  wire  limited.  The  cost  of  a  network  is  not  a  function  of 
the  number  of  switches  required,  but  rather  a  function  of  the  wiring  density  required  to 
construct  the  network.  This  paper  analyzes  communication  networks  of  varying  dimension 
under  the  assumption  of  constant  wire  bisection.  Expressions  for  the  latency,  average  case 
throughput,  and  hot-spot  throughput  of  k- ary  n-cube  networks  with  constant  bisection  are 
derived  that  agree  closely  with  experimental  measurements.  It  is  shown  that  low¬ 
dimensional  networks  (e.g.,  tori)  have  lower  latency  and  higher  hot-spot  throughput  than 
high-dimensional  networks  (e.g.,  binary  n-cubes)  with  the  same  bisection  width. 
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Abstract 

VLSI  communication  networks  are  wire  limited.  The  coat  of  a  network  ia  not  a  function  of  the  number 
of  switches  required,  but  rather  a  function  of  the  wiring  density  required  to  construct  the  network.  This 
paper  analyses  communication  networks  of  varying  dimension  under  the  assumption  of  constant  wire 
bisection.  Expressions  for  the  latency,  average  case  throughput,  and  hot-spot  throughput  of  k- ary  n- 
cube  networks  with  constant  bisection  are  derived  that  agree  closely  with  experimental  measurements. 
It  is  shown  that  low-dimensional  networks  (e.g.,  tori)  have  lower  latency  and  higher  hot-spot  throughput 
than  high-dimensional  networks  (e.g.,  binary  n-cubes)  with  the  same  bisection  width. 
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I  Introduction 


The  critical  component  of  a  concurrent  computer  is  its  communication  network.  Many  al¬ 
gorithms  are  communication  rather  than  processing  limited.  Fine-grain  concurrent  programs 
execute  as  few  as  10  instructions  in  response  to  a  message  [5].  To  efficiently  execute  such  pro¬ 
grams  the  communication  network  must  have  a  latency  no  greater  than  about  10  instruction 
times,  and  a  throughput  sufficient  to  permit  a  large  fraction  of  the  nodes  to  transmit  simul¬ 
taneously.  Low-latency  communication  is  also  critical  to  support  code  sharing  and  garbage 
collection  across  nodes. 

,T)m  wealth  described  ia  this  paper  was  supported  is  part  by  the  Defease  Advanced  Research  Projects 
Agency  uader  contracts  N00014-SO-C-M22  and  N 0001 4-8S-K-0 124  and  ia  part  by  a  National  Science  Foundation 
Presidential  Young  Investigator  Award  with  matching  funds  from  General  Electric  Corporation. 

*  A  preliminary  verrioa  of  this  paper  appeared  in  the  proceedings  of  the  1987  Stanford  Conference  on  Advanced 
Research  ia  VLSI  [8]. 
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As  the  grain  size  of  concurrent  computers  continues  to  decrease,  communication  latency  becomes 
a  more  important  factor.  The  diameter  of  the  machine  grows,  messages  are  sent  more  frequently, 
and  fewer  instructions  are  executed  in  response  to  each  message.  Low  latency  is  more  difficult 
40  achieve  in  a  fine-grain  machine  because  the  available  wiring  space  grows  more  slowly  than 
the  expected  traffic.  Since  the  machine  must  be  constructed  in  three  dimensions,  the  bisection 
area  grows  only  as  ivl  while  traffic  grows  at  least  as  fast  as  N,  the  number  of  nodes. 

VLSI  systems  are  wire  limited.  The  cost  of  these  systems  is  predominantly  that  of  connecting 
devices,  and  the  performance  is  limited  by  the  delay  of  these  interconnections.  Thus,  to  achieve 
the  required  performance,  the  network  must  make  efficient  use  of  the  available  wire.  The 
topology  of  the  network  must  map  into  the  three  physical  dimensions  so  that  messages  are 
not  required  to  double  back  on  themselves,  and  in  a  way  that  allows  messages  to  use  all  of  the 
available  bandwidth  along  their  path. 

This  paper  considers  the  problem  of  constructing  wire-efficient  communication  networks,  net¬ 
works  that  give  the  optimum  performance  for  a  given  wire  density.  We  compare  networks 
holding  wire  bisection,  the  number  of  wires  crossing  a  cut  that  evenly  divides  the  machine,  con¬ 
stant.  Thus  we  compare  low  dimensional  networks  with  wide  communication  rhxnwU  against 
high  dimensional  networks  with  narrow  channels.  We  investigate  the  class  of  Jfc-ary  n-cube  in¬ 
terconnection  networks  and  show  that  low-dimensional  networks  out  perform  high-dim»n«inn»l 
networks  with  the  same  bisection  width. 

The  remainder  of  this  paper  describes  the  design  of  wire-efficient  communication  networks. 
Section  2  describes  the  assumptions  on  which  this  paper  is  based.  The  family  of  fc-ary  n-cube 
networks  is  described  in  Section  2.1.  We  restrict  our  attention  to  Jfc-ary  n-cubes  because  it  is  the 
dimension  of  the  network  that  is  important,  not  the  details  of  its  topology.  Section  2.2  introduces 
wormhole  routing  [18],  a  low-latency  routing  technique.  Network  cost  is  determined  primarily 
by  wire  density  which  we  will  measure  in  terms  of  bisection  width.  Section  2.3  introduces  the 
idea  of  bisection  width ,  and  discusses  delay  models  for  network  channels.  A  performance  model 
of  these  networks  is  derived  in  Section  3.  Expressions  are  given  for  network  latency  as  a  function 
of  traffic  that  agree  closely  with  experimental  results.  Under  the  assumption  of  constant  wire 
density,  it  is  shown  that  low-dimensional  networks  achieve  lower  latency  and  better  hot-spot 
throughput  than  do  high-dimensional  networks. 

2  Preliminaries 

2.1  Jfc-ary  n-cubes 

Many  different  network  topologies  have  been  proposed  for  use  in  concurrent  computers:  trees 
[4]  [13]  [19],  Benes  networks[3],  Batcher  sorting  networks  [1],  shuffle  exchange  networks  [21], 
Omega  networks  [12],  indirect  binary  n-cube  or  flip  networks  [2]  [20],  and  direct  binary  n-cubes 
[17],  [15],  [22].  The  binary  n-cube  is  a  special  case  of  the  family  of  Jfc-ary  n-cubes,  cubes  with  n 
dimensions  and  k  nodes  in  each  dimension. 

Most  concurrent  computers  have  been  built  using  networks  that  are  either  Jfc-ary  n-cubes  or 


Figure  1:  A  Binary  6-Cube  Embedded  in  the  Plane 

are  isomorphic  to  k- ary  n-cubes:  rings,  meshes,  tori,  direct  and  indirect  binary  n-cubea,  and 
Omega  networks.  Thus,  in  this  paper  we  restrict  our  attention  to  Jfc-ary  n-cube  networks.  We 
refer  to  n  as  the  dimension  of  the  cube  and  k  as  the  radix.  Dimension,  radix,  and  number  of 
nodes  are  related  by  the  equation 

N  =  kn,  (k=  sfN,  n  =  log*  N)  .  (1) 

It  is  the  dimension  of  the  network  that  is  important,  not  the  details  of  its  topology. 

A  node  in  a  fc-ary  n-cube  can  be  identified  by  an  n-digit  radix  k  address,  a<>,. .  .,On_i.  The 
»"**  dipt  of  the  address,  a,,  represents  the  nodes  position  in  the  i0*  dimension.  Each  node 
can  forward  messages  to  its  upper  neighbor  in  each  dimension,  t,  with  address,  ao, . . . ,  ai  + 
l(modk),...,On_i. 

In  this  paper  we  assume  that  our  k- ary  n-cube  are  unidirectional  for  simplicity.  We  will  see 
that  our  results  do  not  change  appreciably  for  bidirectional  networks.  For  an  actual  machine, 
however,  there  are  many  compelling  reasons  to  make  our  networks  bidirectional.  Most  impor¬ 
tantly,  bidirectional  networks  allow  us  to  exploit  locality  of  communication.  If  an  object,  A , 
sends  a  message  to  an  object,  B,  there  is  a  high  probability  of  B  sending  a  message  back  to  A. 
In  a  bidirectional  network,  a  round  trip  from  A  to  B  can  be  made  short  by  placing  A  and  B 
close  together.  In  a  unidirectional  network,  a  round  trip  will  always  involve  completely  circling 
the  machine  in  at  least  one  dimension. 

Figures  1-3  show  three  fc-ary  n-cube  networks  in  order  of  decreasing  dimension.  Figure  1 
shows  a  binary  6-cube  (64  nodes).  A  3-ary  4-cube  (81  nodes)  is  shown  in  Figure  2.  An  8- 
ary  2-cube  (64  nodes),  or  torus,  is  shown  in  Figure  3.  Each  line  in  Figure  1  represents  two 
communication  channels,  one  in  each  direction,  while  each  line  in  Figures  2  and  3  represents  a 
single  communication  channel. 


Figure  2:  A  Ternary  4-  Cube  Embedded  in  the  Plane 
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Figure  4:  Latency  of  store-and-forward  routing  (top)  vs.  wormhole  routing  (bottom). 

2.2  Wormhole  Routing 

In  this  paper  we  consider  networks  that  use  wormholef  18]  rather  than  »ton-and-forward  [23] 
routing.  Instead  of  storing  a  packet  completely  in  a  node  and  then  transmitting  it  to  the  next 
node,  wormhole  routing  operates  by  advancing  the  head  of  a  packet  directly  from  incoming  to 
outgoing  channels.  Only  a  few  flow  control  digits  (flits)  are  buffered  at  each  node.  A  flit  is  the 
smallest  unit  of  information  that  a  queue  or  channel  can  accept  or  refuse. 

As  soon  as  a  node  examines  the  header  flit(s)  of  a  message,  it  selects  the  next  channel  on 
the  route  and  begins  forwarding  flits  down  that  channel.  As  flits  are  forwarded,  the  message 
becomes  spread  out  acroes  the  channels  between  the  source  and  destination.  It  is  possible  for 
the  first  flit  of  a  message  to  arrive  at  the  destination  node  before  the  last  flit  of  the  message 
has  left  the  source.  Because  most  flits  contain  no  routing  information,  the  flits  in  a  message 
must  remain  in  contiguous  channels  of  the  network  and  cannot  be  interleaved  with  the  flits  of 
other  messages.  When  the  header  flit  of  a  message  is  blocked,  all  of  the  flits  of  a  message  stop 
advancing  and  block  the  progress  of  any  other  message  requiring  the  channels  they  occupy. 

A  method  similar  to  wormhole  routing,  called  virtual  cut-through,  is  described  in  [11].  Virtual 
cut-through  differs  from  wormhole  routing  in  that  it  buffers  messages  when  they  block,  removing 
them  from  the  network.  With  wormhole  routing,  blocked  messages  remain  in  the  network. 
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Figure  4  illustrates  the  advantage  of  wormhole  routing.  There  are  two  components  of  latency, 
distance  and  message  aspect  ratio.  The  distance,  D ,  is  the  number  of  hops  required  to  get  from 
the  source  to  the  destination.  The  message  aspect  ratio  (message  length,  L ,  normalized  to  the 
channel  width,  W)  is  the  number  of  channel  cycles  required  to  transmit  the  message  across  one 
channel.  The  top  half  of  the  figure  shows  store- and-forward  routing.  The  message  is  is  entirely 
transmitted  from  node  No  to  node  Ni,  then  from  N\  to  Nj  and  so  on.  With  store-and-forward 
routing,  latency  is  the  product  of  D,  and  fo. 
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TsF  =  Te(Dx±y  (2) 

The  bottom  half  of  Figure  4  shows  wormhole  routing.  As  soon  as  a  flit  arrives  at  a  node,  it  is 
forwarded  to  the  next  node.  With  wormhole  routing  latency  is  reduced  to  the  sum  of  D  and 

*• 

TWH  =  Tc  (i>  +  A)  .  (3) 

In  both  of  these  equations,  Tc  is  the  channel  cycle  time,  the  amount  of  time  required  to  perform 
a  transaction  on  a  channel. 


2.3  VLSI  Complexity 

VLSI  computing  systems  [14]  are  wire-limited;  the  complexity  of  what  can  be  constructed  is 
limited  by  wire  density,  the  speed  at  which  a  machine  can  run  is  limited  by  wire  delay,  and 
the  majority  of  power  consumed  by  a  machine  is  used  to  drive  wires.  Thus,  machines  must 
be  organized  both  logically  and  physically  to  keep  wires  short  by  exploiting  locality  wherever 
possible.  The  VLSI  architect  must  organize  a  computing  system  so  that  its  form  (physical 
organization)  fits  its  function  (logical  organization). 

Networks  have  traditionally  been  analyzed  under  the  assumption  of  constant  channel  band¬ 
width.  Under  this  assumption  each  channel  is  one  bit  wide  (W  =  1)  and  has  unit  delay 
(Te  =  1).  The  constant  bandwidth  assumption  favors  networks  with  high  dimensionality  (e.g., 
binary  n*cubes)  over  low-dimensional  networks  (e.g.,  tori).  This  assumption,  however,  is  not 
consistent  with  the  properties  of  VLSI  technology.  Networks  with  many  dimensions  require 
more  and  longer  wires  than  do  low-dimensional  networks.  Thus,  high-dimensional  networks 
cost  more  and  run  more  slowly  than  low-dimensional  networks.  A  realistic  comparison  of  net¬ 
work  topology  must  take  both  wire  density  and  wire  length  into  account. 

To  account  for  wire  density,  we  will  use  bisection  width  [24]  as  a  measure  of  network  cost.  The 
bisection  width  of  a  network  is  the  minimum  number  of  wires  cut  when  the  network  is  divided 
into  two  equal  halves.  Rather  than  comparing  networks  with  constant  channel  width,  W ,  we 
will  compare  networks  with  constant  bisection  width.  Thus,  we  will  compare  low-dimensional 
networks  with  large  W  with  high-dimensional  networks  with  small  W. 
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Figure  5:  A  Folded  Torus  System 


The  delay  of  a  wire  depends  on  its  length,  /.  For  short  wires,  the  delay,  t,,  is  limited  by  charging 
the  capacitance  of  the  wire  and  varies  logarithmically  with  wire  length. 


tg  —  riayC  loge  A/, 


(4) 


where  rmv  is  the  inverter  delay,  and  if  is  a  constant  depending  on  capacitance  ratios. 
For  long  wires,  delay,  f/,  is  limited  by  the  speed  of  light. 


ti  = 


lyfc 

c 


(5) 


In  this  paper  we  will  consider  three  delay  models:  constant  delay,  Te  independent  of  length, 
logarithmic  delay,  Te  a  log  /,  and  linear  delay,  Te  oc  /.  Our  main  result,  that  latency  is  minimized 
by  low-dimensional  networks,  is  supported  by  all  three  models. 


3  Performance  Analysis 


In  this  section  we  compare  the  performance  of  unidirectional  it- ary  n-cube  interconnection 
networks  using  the  following  assumptions: 

•  Networks  must  be  embedded  into  the  plane.  If  a  three-dimensional  packaging  technology 
becomes  available,  the  comparison  changes  only  slightly. 
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•  Nodes  are  placed  systematically  by  embedding  £  logical  dimensions  in  each  of  the  two 
physical  dimensions.  We  assume  that  both  n  and  k  are  even  integers.  The  long  end- 
around  connections  shown  in  Figure  3  can  be  avoided  by  folding  the  network  as  shown  in 
Figure  5. 


For  networks  with  the  same  number  of  nodes,  wire  density  is  held  constant.  Each  network 
is  constructed  with  the  same  bisection  width,  B,  the  total  number  of  wires  crossing  the 
midpoint  of  the  network.  To  keep  the  bisection  width  constant,  we  vary  the  width,  W ,  of 
the  communication  channels.  We  normalize  to  the  bisection  width  of  a  bit-serial  ( W  —  1) 
binary  n-cube. 


•  The  networks  use  wormhole  routing. 


•  Channel  delay,  Te,  is  a  function  of  wire  length,  /.  We  begin  by  considering  channel  delay 
to  be  constant.  Later,  the  comparison  is  performed  for  both  logarithmic  and  linear  wire 
delays;  Tc  oc  log  /  and  Te<xl. 


When  k  is  even,  the  channels  crossing  the  midpoint  of  the  network  are  all  in  the  highest 
dimension.  For  each  of  the  VN  rows  of  the  network,  there  are  Jb(  ? -1)  of  these  channels  in  each 
direction  for  a  total  of  2 VNk($~l)  channels.  Thus,  the  bisection  width,  B,  of  a  Jb-ary  n-cube 
with  W-bit  wide  communication  channels  is 


B(k,n)  =  2Wv/Wfc(?~1)  = 


2  WN 


For  a  binary  n-cube,  k  =  2,  the  bisection  width  is  B( 2,n)  =  WN.  We  set  B  equal  to  N  to 
normalize  to  a  binary  n-cube  with  unit  width  channels,  W  =  1.  The  channel  width,  W(Jb,n), 
of  a  Jb-ary  n-cube  with  the  same  bisection  width,  B,  follows  from  (6): 


7Wlk,n)N 


W(ib,n)  =  f. 


The  peak  wire  density  is  greater  than  the  bisection  width  in  networks  with  n  >  2  because  the 
lower  dimensions  contribute  to  wire  density.  The  maximum  density,  however,  is  bounded  by 


?_1  ?_1  /.  a  ,  \ 

D,^  =  2 WVN  2  Jb’  =  kVN  2  Jb'  =  kVN  \ ) 

l«0  imO  \  *  '  ) 


A  plot  of  wire  density  as  a  function  of  position  for  one  row  of  a  binary  20-cube  is  shown  in 
Figure  6.  The  density  is  very  low  at  the  edges  of  the  cube  and  quite  dense  near  the  center. 
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Figure  6:  Wire  Denaity  va.  Position  for  One  Row  of  a  Binary  20-Cube 

The  peak  denaity  for  the  row  is  1364  at  position  341.  Compare  this  denaity  with  the  bisection 
width  of  the  row,  which  is  1024.  In  contrast,  a  two-dimensional  torus  has  a  wire  density  of 
1024  independent  of  position.  One  advantage  of  high-radix  networks  is  that  they  have  a  very 
uniform  wire  density.  They  make  full  use  of  available  area. 

Each  processing  node  connects  to  2n  channels  (n  input  and  n  output)  each  of  which  is  £  bits 
wide.  Thus,  the  number  of  pins  pei  processing  node  is 

Nv  =  nk.  (9) 

A  plot  of  pin  density  as  a  function  of  dimension  for  N  =  256,  16K  and  1M  nodes3  is  shown 
in  Figure  7.  Low-dimensional  networks  have  the  disadvantage  of  requiring  many  pins  per 
processing  node.  A  two-dimensional  network  with  1M  nodes  (not  shown)  requires  2046  pins 
and  is  clearly  unrealizable.  However,  the  number  of  pins  decreases  very  rapidly  as  the  dimension, 
n,  increases.  Even  for  1M  nodes,  a  dimension  4  node  has  only  128  pins.  All  of  the  configurations 
that  give  low  latency  also  give  a  reasonable  pin  count. 


3.1  Latency 

Latency,  7),  is  the  sum  of  the  latency  due  to  the  network  and  the  latency  due  to  the  processing 
node, 

T/  =  Tnat  +  TaotU*  (10) 

*1 K  m  1024  sad,  lM  -  IK  x  \K  -  104SST6. 
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Figure  7:  Pin  Density  vs.  Dimension  for  256,  16K,  and  1M  Nodes 

In  this  paper  we  are  concerned  only  with  Tnat-  Techniques  to  reduce  Tn(Kje  are  described  in  [5] 
and  [9]. 

If  we  select  two  processing  nodes,  Pi,Pj,  at  random,  the  average  number  of  channels  that  must 
be  traversed  to  send  a  message  from  P,  to  Pj  is  given  by 


M^)»- 


The  average  latency  of  a  Ifc-ary  n-cube  is  calculated  by  substituting  (7)  and  (11),  into  (3) 


Figure  8  shows  the  average  network  latency,  T*, «,  as  a  function  of  dimension,  n,  for  k-ury 
n-cubes  with  2*  (256),  214  (16K),  and  220  (1M)  nodes4.  The  left  most  data  point  in  this 
figure  corresponds  to  a  torus  (n  =  2)  and  the  right  most  data  point  corresponds  to  a  binary 
n-cube  (k  =  2).  This  figure  assumes  constant  wire  delay,  Tc,  and  a  message  length,  X,  of 
150  bits.  This  choice  of  message  length  was  based  on  the  analysis  of  a  number  of  fine-grain 
concurrent  programs  [5].  Although  constant  wire  delay  is  unrealistic,  this  figure  illustrates  that 
even  ignoring  the  dependence  of  wire  delay  on  wire  length,  low-dimensional  networks  achieve 
lower  latency  than  high-dimensional  networks. 


4  For  the  uke  of  comparison  we  allow  radix  to  take  os  bob-lb teger  nines.  For  aome  of  the  dimensions 
considered,  there  is  no  integer  radix,  k,  that  gives  the  correct  number  of  nodes.  In  fact,  this  limitation  can  be 
overcome  by  constructing  a  mixed- radix  cube. 
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Figure  8:  Latency  vs.  Dimension  for  256,  16K,  and  1M  Nodes,  Constant  Delay 

The  latency  of  the  tori  on  the  left  side  of  Figure  8  is  limited  almost  entirely  by  distance.  The 
latency  of  the  binary  n-cubes  on  the  right  side  of  the  graph  is  limited  almost  entirely  by  aspect 
ratio.  With  bit  serial  channels,  these  cubes  take  150  cycles  to  transmit  their  messages  across  a 
single  channel. 

In  an  application  that  exploits  locality  of  communication,  the  distance  between  communicating 
objects  is  reduced.  In  such  a  situation,  the  latency  of  the  low-dimensional  networks  (the  left  side 
of  Figure  8)  is  reduced.  High-dimensional  networks,  on  the  other  hand,  cannot  take  advantage 
of  locality.  Their  latency  will  remain  high. 

In  applications  that  send  short  messages,  the  component  of  latency  due  to  message  length  is 
reduced  resulting  in  lower  latency  for  high-dimensional  networks  (the  right  side  of  Figure  8). 

In  general  the  lowest  latency  is  achieved  when  the  component  of  latency  due  to  distance,  D, 
and  the  component  due  to  message  length,  are  approximately  equal,  D  ss  fa.  For  the  three 
cases  shown  in  Figure  8,  minimum  latencies  are  achieved  for  n  =  2,  4,  and  5  respectively. 

The  longest  wire  in  the  system  becomes  a  bottleneck  that  determines  the  rate  at  which  each 
channel  operates,  Tc.  The  length  of  this  wire  is  given  by 

l  =  (13) 

If  the  wires  are  sufficiently  short,  delay  depends  logarithmically  on  wire  length.  If  the  channels 
are  longer,  they  become  limited  by  the  speed  of  light,  and  delay  depends  linearly  on  channel 


I 


length.  Substituting  (13)  into  (4)  and  (5)  gives 

1  +  log*  l  =  1  +  (  ~  -  1 j  log,  k  logarithmic  delay 
Te«|  V2  '  (14) 

/  =  Jfef _1  linear  delay. 

We  substitute  (14)  into  (12)  to  get  the  network  latency  for  these  two  cases: 

^1  +  ^  -  1^  log,  k^j  n  +  ir)  logarithmic  delay 

T,  <x  <  (15) 

(fc  a  _1)  (  »  +  y)  Hoe"  delay. 

Figure  9  shows  the  average  network  latency  as  a  function  of  dimension  for  Jfc- ary  n- cubes  with 
2s  (256),  214  (16K),  and  2ao  (1M)  nodes,  assuming  logarithmic  wire  delay  and  a  message  length, 
L,  of  150.  Figure  10  shows  the  same  data  assuming  linear  wire  delays.  In  both  figures,  the  left 
most  data  point  corresponds  to  a  torus  (n  »  2)  and  the  right  most  data  point  corresponds  to 
a  binary  n-cube  (it  =  2). 

In  the  linear  delay  case,  Figure  10,  a  torus  (n  =  2)  always  gives  the  lowest  latency.  This 
is  because  a  torus  offers  the  highest  bandwidth  channels  and  the  most  direct  physical  rout ' 
between  two  processing  nodes.  Under  the  linear  delay  assumption,  latency  is  determined  solely 
by  bandwidth  and  by  the  physical  distance  traversed.  There  is  no  advantage  in  having  long 
channels. 

Under  the  logarithmic  delay  assumption,  Figure  9,  a  torus  has  the  lowest  latency  for  small 
networks  (N  =  256).  For  the  larger  networks,  the  lowest  latency  is  achieved  with  slightly  higher 
dimensions.  With  N  =  16 K,  the  lowest  latency  occurs  when  n  is  three5.  With  N  =  1 M,  the 
lowest  latency  is  achieved  when  n  is  5.  It  is  interesting  that  assuming  constant  wire  delay  does 
not  change  this  result  much.  Recall  that  under  the  (unrealistic)  constant  wire  delay  assumption, 
Figure  8,  the  minimum  latencies  are  achieved  with  dimensions  of  2,  4,  and  5  respectively. 

The  results  shown  in  Figures  9  through  8  were  derived  by  comparing  networks  under  the 
assumption  of  constant  wire  cost  to  a  binary  n-cube  with  W  =  1.  For  small  networks  it  is 
possible  to  construct  binary  n-cubes  with  wider  channels,  and  for  large  networks  (e.g.,  lAf 
nodes)  it  may  not  be  possible  to  construct  a  binary  n-cube  at  all.  The  available  wiring  area 
grows  as  iV  t  while  the  bisection  width  of  a  binary  n-cube  grows  as  N.  In  the  case  of  small 
networks,  the  comparison  against  binary  n-cubes  with  wide  channels  can  be  performed  by 
expressing  message  length  in  terms  of  the  binary  n-cube’s  channel  width,  in  effect  decreasing 
the  message  length  for  purposes  of  comparison.  The  net  result  is  the  same:  lower-dimensional 
networks  give  lower  latency.  Even  if  we  perform  the  256  node  comparison  against  a  binary 
n-cube  with  W  =  16,  the  torus  gives  the  lowest  latency  under  the  logarithmic  delay  model, 
and  a  dimension  3  network  gives  minimum  latency  under  the  constant  delay  model.  For  large 

*Ia  sa  actsal  machine  the  dimension  n  would  be  restricted  to  be  as  era  integer. 
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Figure  9:  Latency  vs.  Dimension  for  256,  16K,  and  1M  Nodes,  Logarithmic  Delay 
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Figure  10:  Latency  va.  Dimension  for  256, 16K,  and  1M  Nodes,  Linear  Delay 


networks,  the  available  wire  is  less  than  assumed,  so  the  effective  message  length  should  be 
increased,  making  low-dimensional  networks  look  even  more  favorable. 

In  this  comparison  we  have  assumed  that  only  a  single  bit  of  information  is  in  transit  on  each 
wire  of  the  network  at  a  given  time.  Under  this  assumption,  the  delay  between  nodes,  Te,  is 
equal  to  the  period  of  each  node,  Tp.  In  a  network  with  long  wires,  however,  it  is  possible  to 
have  several  bits  in  transit  at  once.  In  this  case,  the  channel  delay,  Te,  is  a  function  of  wire 
length,  while  the  channel  period,  Tp  <  Te,  remains  constant.  Similarly,  in  a  network  with  very 
short  wires  we  may  allow  a  bit  to  ripple  through  «everal  channels  before  sending  the  next  bit. 
In  this  case,  Tp  >  Tc.  Separating  the  coefficients,  Te  and  Tp,  (3)  becomes 

Tnet=(TcD  +  Tp±y  (16) 

The  net  effect  of  allowing  Tc  jt  Tp  is  the  same  as  changing  the  length,  L,  by  a  factor  of  and 
does  not  change  our  results  significantly. 

When  wire  cost  is  considered,  low-dimensional  networks  (e.g.,  tori)  offer  lower  latency  than 
high-dimensional  networks  (e.g.,  binary  n-cubes).  Intuitively,  tori  outperform  binary  n-cubes 
because  they  better  match  form  to  function.  The  logical  and  physical  graphs  of  the  torus  are 
identical;  Thus,  messages  always  travel  the  minimum  distance  from  source  to  destination.  In  a 
binary  n-cube,  on  the  other  hand,  the  fit  between  form  and  function  is  not  as  good.  A  message 
in  a  binary  n-cube  embedded  into  the  plane  may  have  to  traverse  considerably  more  than  the 
minimum  distance  between  its  source  and  destination. 

3.2  Throughput 

Throughput,  another  important  metric  of  network  performance,  is  defined  as  the  total  number 
of  messages  the  network  can  handle  per  unit  time.  One  method  of  estimating  throughput  is  to 
calculate  the  capacity  of  a  network,  the  total  number  of  messages  that  can  be  in  the  network 
at  once.  Typically  the  maximum  throughput  of  a  network  is  some  fraction  of  its  capacity.  The 
network  capacity  per  node  is  the  total  bandwidth  out  of  each  node  divided  by  the  average 
number  of  channels  traversed  by  each  message.  For  fc-ary  n-cubes,  the  bandwidth  out  of  each 
node  is  nW,  and  the  average  number  of  channels  traversed  is  given  by  (11),  so  the  network 
capacity  per  node  is  given  by 


The  network  capacity  is  independent  of  dimension.  For  a  constant  wire  density,  there  is  a 
constant  network  capacity. 

Throughput  will  be  less  than  capacity  because  contention  causes  some  channels  to  block.  This 
contention  also  increases  network  latency.  To  simplify  the  analysis  of  this  contention,  we  make 
the  following  assumptions: 
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Figure  11:  Contention  Model  for  A  Network 

•  M  esc  ages  are  routed  using  e-cube  routing  (in  order  of  decreasing  dimension)  [6].  That  is 
a  message  at  node  ao, . . . ,  o„_i  destined  for  node  bo, ... ,  6„-i  >■  first  routed  in  dimension 
n  —  1  until  it  reaches  node  ao, . . The  message  is  then  routed  in  dimension 
n-2  until  it  reaches  node  oo« •  •  .,a,,-3,6n_a,6n_i,  and  so  on.  As  shown  in  Figure  11, 
this  assumption  allows  us  to  consider  the  contention  in  each  dimension  separately. 

•  The  traffic  from  each  node  is  generated  by  a  Poisson  process  with  arrival  rate  A^h£. 

•  Message  destinations  are  uniformly  distributed  and  independent. 

The  arrival  rate  of  corresponds  to  At  the  destination,  each  flit  is  serviced 

as  soon  as  it  arrives,  so  the  service  time  at  the  sink  is  To  =  Starting  with  To  we  will 

calculate  the  service  time  seen  entering  each  preceding  dimension. 

For  convenience,  we  will  define  the  following  quantities: 

1 

7  =  *’ 

As  *  7  A e, 

A R  -  (1  -  7)A*f 

Ass  =7aA  E,  (18) 

A  SR  -  7(1  -  7)A  e, 

A  as  =  7(1  -  7)As,and 
A  rr  =  (1  -  7)3As- 

Consider  a  single  dimension,  i,  of  the  network  as  shown  in  Figure  12.  All  messages  incur 
a  latency,  Te,  due  to  contention  on  entering  the  dimension.  Those  messages  that  are  routed 


Figure  12:  Contention  Model  for  A  Single  Dimension 

incur  in  additional  latency,  7*,  due  to  contention  during  routing.  The  rate  A e  message  stream 
entering  the  dimension  is  composed  of  two  components:  a  rate  A 5  stream  that  skipped  the 
previous  (i  +  1*‘)  dimension,  and  a  rate  \r  stream  that  was  routed  in  the  previous  dimension. 
These  two  streams  are  in  turn  split  into  components  that  will  skip  the  dimension  (A 55  and 
Xrs)  and  components  that  will  be  routed  in  the  Ith  dimension  (Xsr  and  Xrr).  The  entering 
latency  seen  by  one  component  (say  Xrr)  is  given  by  multiplying  the  probability  of  a  collision 
(in  this  case  XsaTi+i)  by  the  expected  latency  due  to  a  collision,  (in  this  case  The 

components  that  require  routing  must  also  add  the  latency  due  to  contention  during  routing, 
Tfu-  Adding  up  the  four  components  with  appropriate  weights  gives  the  following  equation  for 
Tj+i. 


Ti+i  =  Ti  +  (1  -  7)1*  +  7(1  -  l?XE{Ti  +  T*)  +  73(1  -  7 )W-  (19) 

For  large  k,  gamma  is  small  and  the  latency  is  approximated  by  T,+i  w  T,  +  7*.  For  1:2 
(binary  n-cubes),  T*  =  0;  thus,  Ti+i  =  7j  + 


\ 


To  calculate  the  routing  latency,  7* ,  we  use  the  model  shown  in  Figure  13.  Given  that  a  message 
is  to  be  routed  in  a  dimension,  the  expected  number  of  channels  traversed  by  the  message  is 
£,  one  entering  channel  and  a  =  continuing  channels.  Thus,  the  average  message  rate  on 
channels  continuing  in  the  dimension  is  Ac  =  o\r.  Using  virtual  channels  and  e-cube  routing, 
the  actual  continuting  rate  on  the  channel  (outer  spiral)  is  A cj  —  (j  -  %f*)A r.  lb  calculate 
Tr  we  need  only  the  average  rate. 

The  service  time  in  the  last  continuing  channel  in  dimension  t  is  =  Ti.  Once  we  know 

the  service  time  for  the  jth  channel,  T%j ,  the  additional  service  time  due  to  contention  at  the 
j  -  1M  channel  is  given  by  multiplying  the  probability  of  a  collision,  A^Tio,  by  the  expected 
waiting  time  for  a  collision,  fy.  Repeating  this  calculation  a  times  gives  us  Tio- 
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Figure  13:  Contention  Model  for  Routing  Latency 


**•-.,  =  ^ 

r«  =  T;  +  2i2Zffi  =  ri  +  ^B, 

_  1  ~  v/1  ~  2Ac7j 

Ac 

Equation  (20)  is  valid  only  when  Ac  <  ^-  If  the  message  rate  is  higher  than  this  limit,  there 
is  no  steady-state  solution  and  latency  becomes  infinite.  There  are  two  solutions  to  (20).  Here 
we  consider  only  the  smaller  of  the  two  latencies.  The  larger  solution  corresponds  to  a  state 
that  is  not  encountered  daring  normal  operation  of  a  network. 

To  calculate  Tm  we  also  need  to  consider  the  possibility  of  a  collision  on  the  entering  channel. 


Tfc  =  r»  (l  +  -Ti. 


(21) 


If  sufficient  queueing  is  added  to  each  network  node,  the  service  times  do  not  increase,  only  the 
latency  and  equations  (21)  and  (19)  become. 

r*-(r^r)(1+iir)-r‘-  <“> 

Ti+i  =  Ti  -Ml- 7  )Tk  +  (7(1  -  7)3  +  7*(1  -  7))  A£7b.  (23) 

To  be  effective,  the  total  queueing  between  the  source  and  destination  should  be  greater  than 
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Figure  14:  Latency  vs.  Traffic  (A)  for  IK  node  networks:  32-ary  2-cube,  4-ary  5- cube,  and 

binary  10-cube,  L=200bits.  Solid  line  is  predicted  latency,  points  are  measurements  taken  from 
a  simulator. 

the  expected  increase  in  latency  due  to  blocking.  One  or  two  flits  of  queueing  per  stage  is 
usually  sufficient.  The  analysis  here  is  pessimistic  in  that  it  assumes  no  queueing. 

Using  equation  (19),  we  can  determine  (1)  the  maximum  throughput  of  the  network  and  (2) 
how  network  latency  increases  with  traffic.  v  1 

Figurw  14  and  15  show -how  latency  increases  as  a  function  of  applied  traffic  for  IK  node  and  4K 
node  *-ary  n-cubes.  The  vertical  axis  shows  latency  in  cycles.  The  horizontal  axis  is  traffic  per 
node.  A,  in  bits/ cycle.  The  figures  compare  measurements  from  a  network  simulator  (points) 
to  the  latency  predicted  by  (21)  (lines).  The  simulation  agrees  with  the  prediction  within  a  few 
percent  until  the  network  approaches  saturation. 

For  IK  networks,  a  32-ary  2-cube  always  gives  the  lowest  latency.  For  4K  networks,  a  16- 
ary  3-cube  gives  the  lowest  latency  when  A  <  0.2.  Because  latency  increases  more  slowly  for 
2- dimensional  networks,  a  64-ary  2-cube  gives  the  lowest  latency  when  X  >  0.2. 

At  the  left  side  of  each  graph  (A  =  0),  latency  is  given  by  (12).  As  traffic  is  applied  to  the 
network  latency  increases  slowly  due  to  contention  in  the  network  untU  saturation  is  reached. 
Saturation  occurs  when  A  is  between  0.3  and  0.5  depending  on  the  network  topology.  Networks 
should  be  designed  to  operate  on  the  flat  portion  of  the  curve  (A  <  0.25). 
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Parameter 


Dimension 


radix 


Max  Throu 


IK  Nodes 


rnsHEEl 


EnssBgaaiii 


Latency  A  =  0.3 


2  I  5 


32 


0.41 


128. 


50.5  161. 


59.3  221. 


4K  Nodes 


16 

8 

0.31 

0.31 

55.2 

79.9 

70.3 

112. 

135. 

245. 

Table  1:  Maximum  Throughput  as  a  Fraction  of  Capacity  and  Blocking  Latency  in  Cycles 


When  the  network  saturates,  throughput  levels  off  as  shown  in  Figures  16  and  17.  These  figures 
show  how  much  traffic  is  delivered  (vertical  axis)  when  the  nodes  attempt  to  inject  a  given 
amount  of  traffic  (horizontal  axis).  The  curve  is  linear  (actual  =  attempted)  until  saturation 
is  reached.  From  this  point  on,  actual  traffic  is  constant.  This  plateau  occurs  because  (1)  the 
network  is  source  queued,  and  (2)  messages  that  encounter  contention  are  blocked  rather  than 
aborted.  In  networks  where  contention  is  resolved  by  dropping  messages,  throughput  usually 
decreases  beyond  saturation. 

To  find  the  maximum  throughput  of  the  network,  the  source  service  time,  To,  is  set  equal  to 
the  reciprocal  of  the  message  rate,  A £,  and  equations  (19),  (20),  and  (21)  are  solved  for  A e- 
The  maximum  throughput  as  a  fraction  of  capacity  for  fc-ary  n-cubes  with  IK  and  4K  nodes  is 
tabulated  in  Table  1.  Also  shown  is  the  total  latency  for  L  =  200 bit  messages  at  several  message 
rates.  The  table  shows  that  the  additional  latency  due  to  blocking  is  significantly  reduced  as 
dimension  is  decreased. 

In  networks  of  constant  bisection  width,  the  latency  of  low-dimensional  networks  increases  more 
slowly  with  applied  traffic  than  the  latency  of  high-dimensional  networks.  At  A  =  0.2,  the  32- 
ary  2-cube  has  «  |  the  latency  of  the  binary  10-cube.  At  this  point,  the  additional  latency 
due  to  contention  in  the  32-ary  2-cube  is  7Te  compared  to  647c  in  the  binary  10-cube.  At 
moderate  loads,  low-dimensional  networks  may  outperform  higher-dimensional  networks  with 
lower  zero-load  latency.  For  example,  a  16-ary  3-cube  has  lower  zero-load  latency  than  a  64-ary 
2-cube  (47.5  vs.  69.25).  However,  the  64-ary  2-cube  has  lower  latency  when  A  =  0.3  (78.6  vs 
135). 

Intuitively,  low-dimensional  networks  handle  contention  better  because  they  use  fewer  channels 
of  higher  bandwidth  and  thus  get  better  queueing  performance.  The  shorter  service  times, 
of  these  networks  results  in  both  a  lower  probability  of  collision,  and  a  lower  expected  waiting 
time  in  the  event  of  a  collision.  Thus  the  blocking  latency  at  each  node  is  reduced  quadratically 
as  k  is  increased.  Low-dimensional  networks  require  more  hops,  D  =  and  have  a  higher 

rate  on  the  continuing  channels,  A <?.  However,  messages  travel  on  the  continuing  channels  more 
frequently  than  on  the  entering  channels,  thus  most  contention  is  with  the  lower  rate  channels. 
Having  fewer  channels  of  higher  bandwidth  also  improves  hot-spot  throughput  as  described 
below. 
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3.3  Hot  Spot  Throughput 


In  many  situations  traffic  ia  not  uniform,  bat  rather  ia  concentrated  into  hot  spots.  A  hot  spot 
ia  a  pair  of  nodea  that  accoonta  for  a  diaproportionately  large  portion  of  the  total  network 
traffic.  Aa  deacribed  by  Pftater  [16]  for  a  ahared-memory  computer,  hot-apot  traffic  can  degrade 
performance  of  the  entire  network  by  canting  congestion. 

The  hot-spot  throughput  of  a  network  ia  the  maximum  rate  at  which  meaaagee  can  be  sent 
from  one  specific  node,  Pt,  to  another  tpecific  node,  P}.  For  a  fc-ary  mcube  with  deterministic 
routing,  the  hot-apot  throughput,  0m,  ia  just  the  bandwidth  of  a  single  channel,  W.  Thus, 
under  the  assumption  of  constant  wire  coat  we  have 

QHS  =  W  =  k-l.  (24) 

Low-dimensional  networks  have  greater  channel  bandwidth  and  thus  have  greater  hot-spot 
throughput  than  do  high-dimensional  networks.  Intuitively,  low-dimensional  networks  operate 
better  under  non-uniform  loads  because  they  do  more  resource  sharing.  In  an  interconnection 
network  the  resources  are  wires.  In  a  high-dimensional  network,  wires  are  assigned  to  particular 
dimensions  and  cannot  be  shared  between  dimensions.  For  example,  in  a  binary  n-cube  it  is 
possible  for  a  wire  to  be  saturated  while  a  physically  adjacent  wire  assigned  to  a  different 
dimension  remains  idle.  In  a  torus  all  physically  adjacent  wires  are  combined  into  a  single 
channel  that  is  shared  by  all  messages  that  must  traverse  the  physical  distance  spanned  by  the 
channel. 


4  Conclusion 


Under  the  assumption  of  constant  wire  bisection,  low-dimensional  networks  with  wide  channels 
provide  lower  latency,  less  contention,  and  higher  hot-spot  throughput  than  high-dimensional 
networks  with  narrow  channels.  Minimum  network  latency  is  achieved  when  the  network  radix, 
k,  and  dimension,  n,  are  chosen  to  make  the  components  of  latency  due  to  distance,  D,  and 
aspect  ratio,  £  approximately  equal.  The  minimum  latency  occurs  at  a  very  low  dimension,  2 
for  up  to  1024  nodes. 

Low  dimensional  networks  reduce  contention  because  having  a  few  high-bandwidth  channels 
results  in  more  resource  sharing  and  thus  better  queueing  performance  than  having  many  low- 
bandwidth  channels.  While  network  capacity  and  worst-case  blocking  latency  are  independent 
of  dimension,  low-dimensional  networks  have  a  higher  maximum  throughput  and  lower  average 
blocking  latency  than  do  high-dimensional  networks.  Improved  resource  sharing  also  gives 
low-dimensional  networks  higher  hot-spot  throughput  than  high-dimensional  networks. 

The  results  of  this  paper  have  all  been  made  under  the  assumption  of  constant  channel  delay, 
independent  of  channel  length.  The  main  result,  that  low-dimensional  networks  give  minimum 
latency,  however,  does  not  change  appreciably  when  logarithmic  or  linear  delay  models  are 
considered.  In  choosing  a  delay  model  one  must  consider  how  the  delay  of  a  switching  node 


compares  to  the  delay  of  a  wire.  Current  VLSI  routing  chips  [7]  have  delays  of  tens  of  nanosec¬ 
onds,  enough  time  to  drive  several  meters  of  wire.  For  such  systems  a  constant  delay  model  is 
adequate.  As  chips  get  faster  and  systems  get  larger,  however,  a  linear  delay  model  will  more 
accurately  reflect  system  performance. 

Fat-tree  networks  have  been  shown  to  be  universal  in  the  sense  that  they  can  efficiently  simulate 
any  other  network  of  the  same  volume  [13].  However,  the  analysis  of  these  networks  has  not 
considered  latency,  Jfc- ary  n-cubes  with  appropriately  chosen  radix  and  dimension  are  also 
universal  in  this  s$nse.  A  detailed  proof  ip  beyond  the  srope  of  this  paper.  Intuitively,  one 
cannot  do  any  better  than  to  fill  each  of  the  three  physical  dimensions  with  wires  and  place 
switches  at  every  point  of  intersection.  Any  point-to-point  network  can  be  embedded  into  such 
a  3-D  mesh  with  no  more  than  a  constant  increase  in  wiring  length. 

This  paper  has  considered  only  direct  networks  [17].  The  results  do  not  apply  to  indirect  net¬ 
works.  The  depth  and  the  switch  degree  of  an  indirect  network  are  analogous  to  the  dimension 
and  radix  of  a  direct  network.  However,  the  bisection  width  of  an  indirect  network  is  indepen¬ 
dent  of  switch  degree.  Because  indirect  networks  do  not  exploit  locality  it  is  not  possible  to 
trade  off  diameter  for  bandwidth.  There  is  little  reason  to  construct  an  indirect  network.  A 
high- bandwidth  direct  network  would  provide  the  same  function  with  increased  performance. 

The  low-dimensional  ifc-ary  n-cube  provide  a  very  general  communication  media  for  digital  sys¬ 
tems.  These  networks  have  been  developed  primarily  for  message-passing  concurrent  computers. 
They  could  also  be  used  in  place  of  a  bus  or  indirect  network  in  a  shared-memory  concurrent 
computer,  in  place  of  a  bus  to  connect  the  components  of  a  sequential  computer,  or  to  connect 
subsystems  of  a  special  purpose  digital  system.  With  VLSI  communication  chips  the  cost  of 
implementing  a  network  node  is  comparable  to  the  cost  of  interfacing  to  a  shared  bus,  and  the 
performance  of  the  network  is  considerably  greater  than  the  performance  of  a  bus. 

The  Torus  Routing  Chip  (TRC)  is  a  VLSI  chip  designed  to  implement  low-dimensional  k- 
ary  n-cube  interconnection  networks  [7].  The  TRC  performs  wormhole  routing  in  arbitrary 
Ar- ary  n-cube  interconnection  networks.  This  self-timed  chip  was  functional  on  first  silicon.  A 
single  TRC  provides  8-bit  data  channels  in  two  dimensions  and  can  be  cascaded  to  add  more 
dimensions  or  wider  data  channels.  A  TRC  network  can  deliver  a  150-bit  message  in  a  1024 
node  32-ary  2-cube  with  an  average  latency  of  7.5/ss,  an  order  of  magnitude  better  performance 
than  would  be  achieved  by  a  binary  n-cube  with  bit-serial  channels.  A  new  routing  chip,  the 
Network  Design  frame  ^NDF),  currently  under  development,  is  expected  to  improve  this  latency 
to  «  1  fit. 

Now  that  the  latency  of  communication  networks  has  been  reduced  to  a  few  microseconds  the 
latency  of  the  processing  nodes,  dominates  the  overall  latency.  To  efficiently  make  use  of 
a  low-latency  communication  network  we  need  a  processing  node  that  interprets  messages  with 
very  little  overhead.  The  design  of  such  a  message-driven  processor  is  currently  underway  [5] 

w. 

The  real  challenge  in  concurrent  computing  is  software.  The  development  of  concurrent  soft¬ 
ware  is  strongly  influenced  by  available  concurrent  hardware.  We  hope  that  by  providing 
machines  with  higher  performance  internode  communication  we  will  encourage  concurrency  to 


be  exploited  it  &  finer  grain  size  in  both  system  and  application  software. 
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